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Abstract 

For several decades, image restoration remains an ac¬ 
tive research topic in low-level computer vision and hence 
new approaches are constantly emerging. However, many 
recently proposed algorithms achieve state-of-the-art per¬ 
formance only at the expense of very high computation 
time, which clearly limits their practical relevance. In this 
work, we propose a simple but effective approach with both 
high computational efficiency and high restoration quality. 
We extend conventional nonlinear reaction diffusion mod¬ 
els by several parametrized linear filters as well as several 
parametrized influence functions. We propose to train the 
parameters of the filters and the influence functions through 
a loss based approach. Experiments show that our trained 
nonlinear reaction diffusion models largely benefit from the 
training of the parameters and finally lead to the best re¬ 
ported performance on common test datasets for image 
restoration. Due to their structural simplicity, our trained 
models are highly efficient and are also well-suited for par¬ 
allel computation on GPUs. 

1. Introduction 

Image restoration is the process of estimating uncor¬ 
rupted images from noisy or blurred ones. It is one of 
the most fundamental operation in image processing, video 
processing, and low-level computer vision. There exists 
a huge amount of literature addressing the topic of image 
restoration problems, see for example [30] for a survey. 
Broadly speaking, most state-of-the-art techniques mainly 
concentrate on achieving utmost image restoration qual¬ 
ity, with little consideration on the computational efficiency 
[43, 29, 18]. However, there are two notable exceptions, 
BM3D [10] and the recently proposed Cascade of Shrink¬ 
age Fields (CSF) [37] model, which simultaneously offer 
high efficiency and high image restoration quality. 

It is well-known that BM3D is a highly engineered 
method, specialized for Gaussian noise. Moreover, it in¬ 
volves a block matching process, which is challenging for 
parallel computation on GPUs, alluding to the fact that it 
is not straightforward to accelerate BM3D algorithm on 
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Figure 1. The figure shows four characteristic influence func¬ 
tions (left plot in each subfigure) together with their correspond¬ 
ing penalty functions (right plot in each subfigure), learned by our 
proposed method. A major finding in this paper is that our learned 
penalty functions significantly differ from the usual penalty func¬ 
tions adopted in partial differential equations and energy mini¬ 
mization methods. In contrast to their usual robust smoothing 
properties which is caused by a single minimum around zero, most 
of our learned functions have multiple minima different from zero 
and hence are able to enhance certain image structures. See Sec. 
4.1 for more information. 


parallel architectures. In contrast, the recently proposed 
CSF model offers high levels of parallelism, making it well 
suited for GPU implementation, thus owning high compu¬ 
tational efficiency. 

Among the approaches to tackle the problem of image 
restoration, nonlinear anisotropic diffusion [33, 40] defines 
a class of efficient approaches, as each diffusion step merely 
contains the convolution operation with a few linear filters. 
A nonlinear diffusion process usually corresponds to cer¬ 
tain Partial Differential Equation (PDE) formulation. How¬ 
ever, up to now, the image restoration quality of diffusion 
based approaches is still far away from the state-of-the-art, 
although with many improvements [20, 12, 34, 19]. 
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We give a brief review of nonlinear diffusion based ap¬ 
proaches and then introduce our proposed diffusion model. 
In the seminal work [33], Perona and Malik (P-M) demon¬ 
strated that nonlinear diffusion models yield very impres¬ 
sive results for image processing. This has given rise to 
many revised models with various formulations. A notable 
variant is the so-called biased anisotropic diffusion (also 
known as reaction diffusion) proposed by Nordstrom [31], 
which introduces a bias term (forcing term) to free the user 
from the difficulty of specifying an appropriate stopping 
time for the P-M diffusion process. This additional term 
reacts against the strict smoothing effect of the pure P-M 
diffusion, therefore resulting in a nontrivial steady-state. 

Tsiotsios et al. [39] discussed the choice of some cru¬ 
cial parameters in the P-M model, such as the diffusivity 
function, the gradient threshold parameter and the stopping 
time of the iterative process. Some works consider mod¬ 
ification to the diffusion term or the reaction term for the 
reaction diffusion model [14, 9, 31, 34, 1], e.g ., Acton et 
al. [1] and Plonkna et al. [34] exploited a more complicated 
reaction term to enhance oriented textures; [38, 3] proposed 
to replace the ordinary diffusion term with a flow equation 
based on mean curvature. Gilboa et al. [16] proposed a for¬ 
ward and backward diffusion process, which incorporates 
explicit inverse diffusion with negative diffusivity coeffi¬ 
cient by carefully choosing the diffusivity function. The 
resultant diffusion processes can adaptively switch between 
forward and backward diffusion process. In a latter work 
[41], the theoretical foundations for discrete forward-and- 
backward diffusion filtering were investigated. Researchers 
also propose to exploit higher-order nonlinear diffusion fil¬ 
tering, which involves larger linear filters, e.g., fourth-order 
diffusion models [20, 12, 1 ]. Meanwhile, theoretical prop¬ 
erties about the stability and local feature enhancement of 
higher-order nonlinear diffusion filtering are established in 
[1 ]• 

It should be noted that all the above mentioned diffusion 
processes are handcrafted models. It is a generally difficult 
task to design a good-performing PDE for a specific image 
processing problem because good insights into this prob¬ 
lem and a deep understanding of the behavior of the PDEs 
are usually required. Therefore, some researcher propose 
to learn PDEs from training data via an optimal control ap¬ 
proach [28]. Unfortunately, at present [2£ ] is the sole pre¬ 
vious work we can find in this direction. The basic idea of 
our approach is the same as [28], but we go much further 
and our proposed model is much more expressive. 

1.1. Motivation and Contributions 

In this paper we focus on nonlinear diffusion process due 
to its high efficiency and propose a trainable nonlinear dif¬ 
fusion model, which is parameterized by the linear filters 
and the influence functions. The trained diffusion model 


contains many special influence functions (see Fig. 1 for an 
illustration), which greatly differ from usual influence func¬ 
tions employed in conventional diffusion models. It turns 
out that the trained diffusion processes can lead to effective 
image restoration with state-of-the-art performance, while 
preserve the property of high efficiency of diffusion based 
approaches. At present, we are not aware of any previous 
works that simultaneously optimize the linear filters and in¬ 
fluence functions of a nonlinear diffusion process 1 . 

Our proposed nonlinear diffusion process has several re¬ 
markable benefits as follows: 

1) It is conceptually simple as it is just a time-dynamic 
nonlinear reaction diffusion model with trained filters 
and influence functions; 

2) It has broad applicability to a variety of image restora¬ 
tion problems. In principle, all existing diffusion based 
models can be revisited with appropriate training; 

3) It yields excellent results for several tasks in image 
restoration, including Gaussian image denoising, and 
JPEG deblocking; 

4) It is computationally very efficient and well suited for 
parallel computation on GPUs. 

2. Proposed reaction diffusion process 

We start with conventional nonlinear diffusion processes, 
then propose a training based reaction diffusion model for 
image restoration. Finally we show the relations between 
the proposed model and existing image restoration models. 

2.1. Perona and Malik diffusion model 

In the whole paper, we stick to the fully discrete set¬ 
ting, where images are represented as column vectors, i.e., 
u e R n . Therefore, the discrete version of the well-known 
Perona-Malik type nonlinear diffusion process [33] can be 
formulated as the following discrete PDE with an explicit 
finite difference scheme 

- = ~ £ ViA(u t )V iUt = - ]T V^(ViUt), 
i={x,y} i={x,y} 

where matrices and \7 y G M iVxAr are finite differ¬ 
ence approximation of the gradient operators in x-direction 
and ^/-direction, respectively and At denotes the time step. 

A (u t ) G R NxN is defined as a diagonal matrix 

a( u t ) = diag (g (Y(V x u t )2 + ( v y u t)p)) p=1 N , 

where function g is known as edge-stopping function ] 
or diffusivity function [40], a typical g function given as 

1 Even though the linear filters and penalty functions in the image prior 
model [35, 7] can be trained simultaneously, the penalty function is opti¬ 
mized only in the sense that the weight a of certain fixed function (e.g., 

a • log(l + z 2 )) can be tuned. Our approach can exploit much more gener¬ 
alized penalty functions (actually arbitrary functions), which is intractable 
in those previous models. 





g(z) = 1/(1 + z 2 ). If ignoring the coupled relation be¬ 
tween V x u and X7 y u, the P-M model can be also writ¬ 
ten as the second formula on the right side in (1), where 
<fi(\7iu) = ((f)(\7iu) i,--- ,0 (V^)at) T G with func¬ 
tion <fi(z) = zg(z ), known as influence function [ 4 ] or flux 
function [40]. In the upcoming subsection, we will stick to 
this decoupled formulation. 


2.2. Proposed nonlinear diffusion model 


Clearly, the matrix-vector product, V x u can be inter¬ 
preted as a 2D convolution of u with the linear filter k x = 
[—1,1] (V y corresponds to the linear filter k y = [— 1 ,1] T ). 
Intuitively, in order to improve the capacity of the diffusion 
model, we can employ more filters of larger kernel size, in 
contrast to previous works that typically involve few filters 
with relatively small kernel size. We can additionally con¬ 
sider different influence functions for different filters, rather 
than an unique one. Moreover, the parameters of each itera¬ 
tion can vary across iterations. Taking the reaction term into 
account, our proposed nonlinear reaction diffusion model is 
formulated as 


ut ~ u t -1 

At 


N k 


= ~Y]K\ - V’K-l,/n), 

z ' \ ___ ✓ 


i=1 


reaction term 


diffusion term 


where 7Q e M iVxAr is a highly sparse matrix, implemented 
as 2D convolution of the image u with the filter kernel ki , 
i.e., KiU ki * u 9 Ki is a set of linear filters and Nk is 
the number of filters. Function fa operates point-wise to 
the filter response KiU. In practice, we set At = 1, as we 
can freely scale the formula on the right side. Note that in 
our proposed diffusion model, the influence functions are 
adjustable and can be different from each other. 

The specific formulation for the reaction term ^(u) de¬ 
pends on applications. For classical image restoration prob¬ 
lems, such as Gaussian denoising, image deblurring, image 
super resolution and image inpainting, we can set the reac¬ 
tion term to be the gradient of a data term, i.e. ip(u) = 
Y/ u V{u). For example, if V(u, f n ) = ||| An - f n |||, 
ip(u) = XA t (Au — f n ), where f n is the degraded image, 
A is the associated linear operator, and A is related to the 
strength of the reaction term. In the case of Gaussian de¬ 
noising, A is the identity matrix. 

In our work, instead of making use of the well-chosen 
filters and influence functions, we train the nonlinear diffu¬ 
sion process for specific image restoration problem, includ¬ 
ing both the linear filters and the influence functions. As 
the diffusion process is an iterative approach, typically we 
run it for certain iterations. In order to make our proposed 
diffusion process more flexible, we train the parameters of 
the diffusion model for each single iteration. Finally, we 
arrive at a diffusion process which merely involves several 
iterations (referred to as stages). 


2.3. Relations to existing image restoration models 

Previous works [ 36 , 31 ] show that in the nonlinear dif¬ 
fusion framework, there exist natural relations between re¬ 
action diffusion and regularization based energy functional. 
First of all, we can interpret (2) as one gradient descent step 
at u t -i of a certain energy functional given by 
N k 

E{u,f n ) = Y J K i {u)+V{u,f n ), ( 3 ) 

2=1 

where 7 Zi(u) = are the regularizers and 

the functions p\ are the so-called penalty functions. Note 
that p'(z) = faz). Since the parameters {Kj,p\} vary 
across the stages, (3) is a dynamic energy functional, which 
changes at each iteration. 

In the case of fixed {Kj, p\} across the stages t, it is ob¬ 
vious that functional (3) is exactly the fields of experts im¬ 
age prior regularized variational model for image restora¬ 
tion [35, 8, 7]. In our work, we do not exactly solve this 
minimization problem anymore, but in contrast, we run the 
gradient descent step for several stages, and each gradient 
descent step is optimized by training. 

In a very recent work [37], Schmidt et al. exploited an 
additive form of half-quadratic optimization to solve the 
same problem (3), which finally leads to a fast and effective 
image restoration model called cascade of shrinkage fields 
(CSF). The CSF model makes an assumption that the data 
term in (3) is quadratic and the operator A can be inter¬ 
preted as a convolution operation, such that the correspond¬ 
ing subproblem has fast closed-form solution based on dis¬ 
crete Fourier transform (DFT). This restrains its applicabil¬ 
ity to many other problems such as image super resolution. 
However, our proposed diffusion model does not have this 
restriction on the data term. In principle, any smooth data 
term is appropriate. Moreover, as shown in the following 
sections, we can even handle the case of non-smooth data 
term. 

There exist some previous works [ 2 , 13 ], also trying to 
train an optimized gradient descent algorithm for the en¬ 
ergy functional similar to ( 3 ). In their works, the Gaus¬ 
sian denoising problem is considered, and the trained gra¬ 
dient descent algorithm typically involves less than 10 iter¬ 
ations. However, their model is much more constrained, in 
the sense that, they exploited the same filters for each gradi¬ 
ent descent step. More importantly, the influence function 
in their model is fixed to be a unique one. This clearly re¬ 
stricts the model performance, as demonstrated in Sec. 4.1. 

There are also few preliminary works, e.g ., [28] to go 
beyond traditional PDEs of the form (1), and propose to 
learn optimal PDEs for image restoration via optimal con¬ 
trol. However, the investigated PDE model in [28] is too 
simple to generate a promising performance, as they only 
optimize the linear combination coefficients of a few prede¬ 
fined terms, which depend on selected derivative filters. 
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Figure 2. The architecture of our proposed diffusion model. Note 
that the additional convolution step with the rotated kernels fe ( cf 
Equ. 7) does not appear in conventional feed-forward CNs. Our 
model can be interpreted as a CN with a feedback step, which 
makes it different from conventional feed-forward networks. Due 
to the feedback step, it can be categorized into recurrent neural 
networks [17]. 

The proposed diffusion model also bears an interesting 
link to the convolutional networks (CNs) employed for im¬ 
age restoration problems [23]. One can see that each it¬ 
eration (stage) of our proposed diffusion process involves 
the convolution operation with a set of linear filters, and 
thus it can be treated as a convolutional network. The ar¬ 
chitecture of our proposed network is shown in Figure 2, 
where one can see that it is not a pure feed-forward network 
any more, because it has a feedback step. Therefore, the 
structure of our CN model is different from conventional 
feed-forward networks. Due to this feedback step, it can be 
categorized into recurrent neural networks [17]. Moreover, 
the nonlinearity ( e.g ., influence functions in the context of 
nonlinear diffusion) in our proposed network are trainable. 
However, conventional CNs make use of fixed activation 
function, e.g., ReLU functions or sigmoid functions. 

3. Learning 

In this paper, we train our models for two representative 
image restoration problems: ( 1 ) denoising of images cor¬ 
rupted by Gaussian noise and (2) JPEG blocking artifacts 
reduction, which is formulated as a non-smooth problem. 
We use a loss minimization scheme to learn the model pa¬ 
rameters 0 t = {A t ,(j)\,k t i } for each stage t of the diffusion 
process, given S training samples {fn\u^} f =1 , where 

fn^ is a noisy input and is the corresponding ground 
truth clean image. 

We firstly consider a greedy training strategy to train the 
diffusion processes stage-by-stage, i.e., at stage t, we mini¬ 
mize the cost function 

£(®t) = J2 £ ( U t S) ’ U 9t') ’ (4) 

s =1 
(s) 

where u\ ’ is the output of stage t of the diffusion process. 
We prefer the usual quadratic loss function to the negative 
PSNR used in [37], because the latter one imposes more 


weights on those samples with relatively smaller cost, and 
thus leads to slightly inferior results in practice. The loss 
function is given as 

K u t S \ ui gt) = ^ll M t S) - u{ gt\\l • ( 5 ) 

Parameterizing the influence functions 4>\\ We parame¬ 
terize the influence function via standard radial basis func¬ 
tions (RBFs), i.e., each function </> is represented as a 
weighted linear combination of a family of RBFs as follows 



where ip represents different RBFs. In this paper, we exploit 
RBFs with equidistant centers fij and unified scaling 7 j. We 
investigate two typical RBFs [22]: (1) Gaussian radial basis 
and ( 2 ) triangular-shaped radial basis. 

In general, the Gaussian RBF can provide better approx¬ 
imation for generally smooth function than the triangular¬ 
shaped RBF with the same number of basis functions. How¬ 
ever, the triangular-shaped RBF based function parameteri¬ 
zation has the advantage of computational efficiency. More 
details can be found in the supplemental material . In our 
work, we consider both function parameterization methods, 
but only present the results achieved based on the Gaussian 
RBF. 

Training for denoising: According to the diffusion equa¬ 
tion ( 2 ), for image denoising, the output of stage t is given 
as 


' N k 


U t 


— 'tLt —1 ( ^ ^ * 'U't—l) 3“ A {lit— 1 fn) 


\i=l 


where we explicitly use a convolution kernel ki (obtained 
by rotating the kernel ki 180 degrees) to replace the Kj for 
the sake of model simplicity 2 . 

Training for deblocking: Motivated by [5], we consider 
a new variational model for JPEG deblocking based on the 
FoE image prior model 


N k 

argminiS(iz) = pi(fc* * u) + Xq (Du) , (8) 

u zJ 

i =1 

where Xq is a indicator function over the set Q (quantiza¬ 
tion constraint set). In JPEG compression, information loss 
happens in the quantization step, where all possible values 
in the range [d — 0.5, d + 0.5] (d is an integer) are quan¬ 
tized to a single number d. Given a compressed data, we 
only know d. Therefore, all possible values in the interval 

2 We use the symmetric boundary condition in our work. In this case, 
Kj can be interpreted as the convolution kernel ki only in the central 
region. Therefore, we actually slightly modify the original model. 

























[d — 0.5, d + 0.5] define a convex set Q which is a box con¬ 
straint. The sparse matrix D E 'R NxN denotes the block 
DCT transform. We refer to [5] for more details. 

We derive the diffusion process w.r.t the variational 
model (8) using the proximal gradient method [32], which 
reads as 

Ut = D t proj Q (d (u t -1 - > 

(9) 

where projg(-) denotes the orthogonal projection onto Q. 
More details can be found in the supplemental material. 
Gradients: We minimize (4) with commonly used gradi¬ 
ent based L-BFGS algorithm [27]. The gradient of the loss 
function at stage t w.r.t the model parameters 0 t is com¬ 
puted using standard chain rule, given as 

d£(ut,Ugt) _ diif d£(ut : Ugt) (Iff) 

dG t ~dQt' d^t ’ 1 * 

where d ^ u ^at) — Ut _ Ugt j s directly derived from (5), 
is computed from (7) for the training of denoising task 
or (9) for the deblocking training, respectively. We do not 
present the derivatives for specific model parameters due to 
space limitation. All derivatives can be found in the supple¬ 
mental material. 

Joint training: In (4), each stage is trained greedily such 
that the output of each stage is optimized according to the 
loss function, regardless of the total stages T used in the 
diffusion process. A better strategy would be to jointly train 
all the stages simultaneously. The joint training task is for¬ 
mulated as 

s 

- ,t) = y ^£(u^\u[f ), (11) 

3 = 1 

where the loss function only depends on ut (the output of 
the final stage T). The gradients of the loss function w.r.t 
©t is given as 

d£(u T , Ugt) _ dut_ du t +i d£(u T ,u gt ) 

dQ t 90 t du t du T 

which is the standard back-propagation technique widely 
used in the neural networks learning [2: ]. Compared with 
the greedy training, we additionally need to calculate . 
All the derivations can be found in the supplemental mate¬ 
rial . 

4. Experiments 

We used the same 400 training images as [37], and 
cropped a 180 x 180 region from each image, resulting in a 
total of 400 training samples of size 180 x 180, i.e., roughly 
13 million pixels. 

We trained the proposed diffusion process with at most 8 
stages to observe its saturation behavior after some stages. 


We first greedily trained T stages of our model with spe¬ 
cific model capacity, then conducted a joint training for the 
parameters of the whole T stages. 

In our work, we mainly considered two trained reaction 
diffusion (TRD) models. 

TRD^ X 5 ,Fully trained model with 24 filters of size 5x5, 
TRD 7x7 ,Fully trained model with 48 filters of size 7x7, 

where TRD^ xm denotes a nonlinear diffusion process of 
stage T with filters of size m x m. The filters number is 
m 2 — 1, if not specified. 

Note that the calculation of the gradients of the loss func¬ 
tion in (10) can be accomplished with convolution tech¬ 
nique efficiently, even with a simple Matlab implementa¬ 
tion. The training time varies greatly for different config¬ 
urations. Important factors include (1) model capacity, (2) 
number of training samples, (3) number of iterations taken 
by the L-BFGS, and (4) number of Gaussian RBF kernels 
used for function approximation. We report below the most 
time consuming cases. 

In training, computing the gradients |j| with respect 
to the parameters of one stage for 400 images of size 
180 x 180 takes about 35s (TRD 5x5 ), 75s (TRD 7x7 ) or 165s 
(TRDgxg) with Matlab implementation on a server with 
CPUs: Intel(R) Xeon E5-2680 @ 2.80GHz (eight parallel 
threads, 63 Gaussian RBF kernels for the influence function 
parameterization). We typically run 200 L-BFGS iterations 
for optimization. Therefore, the total training time, e.g ., for 
the TRD 7x7 model is about 5 x (200 x 75)/3600 = 20.Sh. 
Code for learning and inference is available on the authors' 
homepage www. GPU4Vision . org. 

4.1. Image denoising experiments 

We started with the training model of TRD^ x5 . We first 
considered the greedy scheme to train a diffusion process up 
to 8 stages (i.e., T < 8), in order to observe the asymptotic 
behavior of the diffusion process. After the greedy training 
was completed, we conducted joint training for a diffusion 
model of certain stages (e.g., T = 5), by simultaneously 
tuning the parameters in all stages. 

We initialized the joint training with the parameters ob¬ 
tained from greedy training, as this is guaranteed not to de¬ 
crease the training performance. In previous work [37], it is 
shown that joint training a model with filters of size 5x5 
or larger hardly makes a difference relative to the result ob¬ 
tained by the greedy training. However, in our work we 
observed that joint training always improves the result of 
greedy training. 

Note that for the models trained in the greedy manner, 
we can stop the inference at any stage, as its output of each 
stage is optimized. However, for the jointly trained models, 
we have to run T stages, as in this case only the output of 
the T th stage is optimized. 









Method 

(7 

St. 

(7 = 

15 

15 

25 

TRDsxs 

TRD 7 X 7 

BM3D 

31.08 

28.56 

2 

31.14 

31.30 

LSSC 

31.27 

28.70 

5 

31.30 

31.42 

EPLL 

31.19 

28.68 

8 

31.34 

31.43 

opt-MRF 

31.18 

28.66 


(7 = 

25 

RTFs 

- 

28.75 


TRD 5 x 5 

TRD 7 X 7 

WNNM 

31.37 

28.83 

2 

28.58 

28.77 

csf^ x5 

31.14 

28.60 

5 

28.78 

28.92 

CS¥ 5 7x7 

31.24 

28.72 

8 

28.83 

28.95 


Table 1. Average PSNR (dB) on 68 images from [3 ] for image 
denoising with a — 15, 25. 


We first trained our diffusion models for the Gaussian de¬ 
noising problem with standard deviation a = 25. The noisy 
training images were generated by adding synthetic Gaus¬ 
sian noise with a = 25 to the clean images. Once we have 
trained a diffusion model, we evaluated its performance on 
a standard test dataset of 68 natural images . 3 

We present the final results of the joint training in Ta¬ 
ble 1 , together with a selection of recent state-of-the-art de¬ 
noising algorithms, namely BM3D [10], LSSC [29], EPLL 
[43], opt-MRF [7], RTF 5 model [24] and two very recent 
methods: the CSF model [37] and WNNM [15 ]. We down¬ 
loaded these algorithms from the corresponding author’s 
homepage, and used them as is. 

Concerning the performance of our TRD ^ x5 models, we 
find that joint training usually leads to an improvement of 
about 0.1 dB in the cases of T > 5. From Table 1, one 
can see that (1) the performance of the TRD ^ x5 model sat¬ 
urates after stage 5, i.e., in practice, 5 stages are typically 
enough; ( 2 ) our TRDs x5 model has achieved significant im¬ 
provement (28.78 vs. 28.60), compared to a similar model 
CSF 5 X5 , which has the same model capacity and (3) more¬ 
over, our TRD 5 X5 model is on par with so far the best- 
reported algorithm - WNNM. 

When comparing with some closely related models such 
as the FoE prior based variational model [7], the FoE de¬ 
rived CSF model [3 ] and convolutional networks (CNs) 
[2 ], our trained models can provide significantly superior 
performance. Therefore, a natural question arises: what is 
the critical factor in the effectiveness of the trained diffusion 
models? There are actually two main aspects in our training 
model: ( 1 ) the linear filters and ( 2 ) the influence functions. 
In order to have a better understanding of the trained mod¬ 
els, we went through a series of experiments to investigate 
the impact of these two aspects. 

Analysis of the proposed diffusion process: Concentrat¬ 
ing on the model capacity of 24 filters of size 5 x 5, we con¬ 
sidered the training of a diffusion process with 10 steps, i.e., 
T = 10 for the Gaussian denoising of noise level a = 25. 
We exploited two main classes of configurations: (A) the 

3 The test images are strictly separate from the training datasets. 


parameters of every stage are the same and (B) every diffu¬ 
sion stage is different from each other. In both configura¬ 
tions, we consider two cases: (I) only train the linear filters 
with fixed influence function 4>(z) = 2z/(l + z 2 ) and (II) 
simultaneously train the filters and influence functions. 

Based on the same training dataset and test dataset, we 
obtained the following results: (A.I) every diffusion step is 
the same, and only the filters are optimized with fixed in¬ 
fluence function. This is a similar configuration to previous 
works [2, 13]. The trained model achieves a test perfor¬ 
mance of 28.47dB. (A.II) with additional tuning of the in¬ 
fluence functions, the resulting performance is boosted to 
28.60dB. (B.I) every diffusion step can be different, but 
only train the linear filters with fixed influence function. 
The corresponding model obtains the result of 28.56dB, 
which is equivalent to the variational model [7] with the 
same model capacity. Finally (B.II) with additional opti¬ 
mization of the influence functions, the trained model leads 
to a significant improvement with the result of 28.86dB. 

The analysis experiments demonstrate that without the 
training of the influence functions, there is no chance to 
achieve significant improvements over previous works, no 
matter how hard we tune the linear filters. Therefore, we be¬ 
lieve that the additional freedom to tune the influence func¬ 
tions is the critical factor of our proposed training model. 
After having a closer look at the learned influence functions 
of the TRD 5 X5 model, these functions reinforce our argu¬ 
ment. 

Learned influence functions: The form of 120 learned 
penalty functions p 4 in the TRD^ model can be divided 
into four classes (see the corresponding subfigures in Fig¬ 
ure 1 ): 

(a) Truncated convex penalty functions with low values 
around zero to encourage smoothness. 

(b) Negative Mexican hat functions, which have a local 
minimum at zero and two symmetric local maxima. 

(c) Truncated concave functions with smaller values at the 
two tails. 

(d) Double-well functions, which have a local maximum 
(not a minimum any more) at zero and two symmetric 
local minima. 

At first glance, the learned penalty functions (except (a)) 
differ significantly from the usually adopted penalty func¬ 
tions used in PDE and energy minimization methods. How¬ 
ever, it turns out that they have a clear meaning for image 
regularization. 

Regarding the penalty function (b), there are two criti¬ 
cal points (indicated by red triangles). When the magnitude 
of the filter response is relatively small (i.e., less than the 
critical points), probably it is stimulated by the noise and 

4 The penalty function p(z) is integrated from the influence function 
4>(z) according to the relation 4>(z) = p'(z) 
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Figure 3. Trained filters (in the first and last stage) of the TRD 7x7 
model for the noise level a — 25. We can find first, second and 
higher-order derivative filters, as well as rotated derivative filters 
along different directions. These filters are effective for image 
structure detection, such as image edge and texture. 

therefore the penalty function encourages smoothing oper¬ 
ation as it has a local minimum at zero. However, once the 
magnitude of the filter response is large enough (i.e., across 
the critical points), the corresponding local patch probably 
contains a real image edge or certain structure. In this case, 
the penalty function encourages to increase the magnitude 
of the filter response, alluding to an image sharpening op¬ 
eration. Therefore, the diffusion process controlled by the 
influence function (b), can adaptively switch between im¬ 
age smoothing (forward diffusion) and sharpening (back¬ 
ward diffusion). We find that the learned influence function 
(b) is closely similar to an elaborately designed function in 
a previous work [16], which leads to an adaptive forward- 
and-backward diffusion process. 

A similar penalty function to the learned function (c) 
with a concave shape is also observed in previous work 
on image prior learning [42]. This penalty function also 
encourages to sharpen the image edges. Concerning the 
learned penalty function (d), as it has local minima at two 
specific points, it prefers specific image structure, implying 
that it helps to form certain image structure. We also find 
that this penalty function is exactly the type of bimodal ex¬ 
pert functions for texture synthesis employed in [21]. 

Now it is clear that the diffusion process involving the 
learned influence functions does not perform pure image 
smoothing any more for image processing. In contrast, it 
leads to a diffusion process for adaptive image smoothing 
and sharpening, distinguishing itself from previous com¬ 
monly used image regularization techniques. 

Influence of initialization: Our training model is also a 
deep model with many stages (layers). It is well-known that 


Method 

256' J 

512 J 

1024 2 

2048 2 

3072 2 

BM3D [10] 

1.1 

4.0 

17 

76.4 

176.0 

CSF^ x7 [37] 

3.27 

11.6 

40.82 

151.2 

494.8 

WNNM [If ] 

122.9 

532.9 

2094.6 

- 

- 


0.51 

1.53 

5.48 

24.97 

53.3 

TRDLs 

0.43 

0.78 

2.25 

8.01 

21.6 


0.005 

0.015 

0.054 

0.18 

0.39 


1.21 

3.72 

14.0 

62.2 

135.9 

trd 5 Tx7 

0.56 

1.17 

3.64 

13.01 

30.1 


0.01 

0.032 

0.116 

0.40 

0.87 


Table 2. Run time comparison for image denoising (in seconds) 
with different implementations. (1) The run time results with 
gray background are evaluated with the single-threaded imple¬ 
mentation on Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz; (2) 
the blue colored run times are obtained with multi-threaded com¬ 
putation using Matlab parfor on the above CPUs; (3) the run time 
results colored in red are executed on a NVIDIA GeForce GTX 
780Ti GPU. We do not count the memory transfer time between 
CPU/GPU for the GPU implementation (if counted, the run time 
will nearly double) 

deep models are usually sensitive to initialization. However, 
our training model is not very sensitive to initialization. We 
have training experiments with fully random initializations 
for (1) greedy training. Using fully random initial parame¬ 
ters in range [—0.5, 0.5], the trained models lead to a devi¬ 
ation within O.OldB in the test phase and (2) joint training. 
Fully random initializations lead to models with inferior re¬ 
sults, e.g., TRD 5 X5 (28.61 vv. 28.78). However, a plain ini¬ 
tialization (all stages with DCT filters, influence function 
ct>(z ) = 2;z/(l + z 2 )) works almost the same, e.g.,TRD 5 X5 , 
(28.75 vs. 28.78) and TRD^ x7 , (28.91 vj. 28.92). 

Training for other configurations: In order to investigate 
the influence of the model capacity, we increase the filter 
size to 7 x 7 and 9x9. We find that increasing the filter 
size from 5 x 5 to 7 x 7 brings a significant improvement of 
0.14dB ( TRD 7x7 vv. TRD 5 X5 ) as show in Table 1. How¬ 
ever, if we further increase the filter size to 9 x 9, the result¬ 
ing TRDg x9 leads to a performance of 28.96dB (a slight 
improvement of 0.05dB relative to the TRD 7x7 model). In 
practice, we prefer the TRD 7x7 model as it provides the 
best trade-off between performance and computation time. 
Fig. 3 shows the trained filters of the TRD 7x7 model in the 
first and last stage. 

We also trained diffusion models for the noise level of 
<7 = 15, and the test performance is shown in Table 1. In ex¬ 
periments, we observed that joint training can always gain 
an improvement of about 0.1 dB over the greedy training for 
the cases of T > 5. 

From Table 1, one can see that for both noise levels, 
the resulting TRD 7x7 model achieves the highest average 
PSNR. The TRD 7x7 model outperforms the benchmark 
BM3D method by 0.35dB in average. This is a notable im- 






Q 

JPEG 

decoder 

TGV 

[5] 

Dic- 

SR[6] 

SADCT 

[15] 

RTF[ 4] 
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10 

26.59 

26.96 
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29.03 

29.46 
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30.06 
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30.13 

30.67 

31.14 

31.41 


Table 3. JPEG deblocking results for natural images, reported with 
average PSNR values. 

provement as few methods can surpass BM3D more than 
0.3dB in average [26]. Moreover, the TRD 7x7 model also 
surpasses the best-reported algorithm - WNNM method, 
which is very slow as shown in Table 2. In summary, our 
TRD 7x7 model outperforms all the recent state-of-the-arts 
on the exploited test dataset, meanwhile it is the fastest 
method even with the CPU implementation. 

Run time: The algorithm structure of our TRD model is 
closely similar to the CSF model, which is well-suited for 
parallel computation on GPUs. We implemented our trained 
models on GPU using CUDA programming to speed up the 
inference procedure, and finally it indeed lead to a signifi¬ 
cantly improved run time, see Table 2. We see that for the 
images of size up to 3K x 3K, the TRD 7x7 model is still 
able to accomplish the denoising task in less than Is. 

We make a run time comparison to other denoising algo¬ 
rithms based on strictly enforced single-threaded CPU com¬ 
putation ( e.g., start Matlab with -singleCompThread) for a 
fair comparison, see Table 2. We only present the results 
of some selective algorithms, which either have the best de¬ 
noising result or run time performance. We refer to [37] 
for a comprehensive run time comparison of various algo¬ 
rithms 5 . 

We see that our TRD model is generally faster than the 
CSF model with the same model capacity. It is reasonable, 
because in each stage the CSF model involves additional 
DFT and inverse DTF operations, i.e., our model only re¬ 
quires a portion of the computation of the CSF model. Even 
though the BM3D is a non-local model, it still possesses 
high computation efficiency. In contrast, another non-local 
model - WNNM achieves compelling denoising results but 
at the expense of huge computation time. Moreover, the 
WNNM algorithm is hardly applicable for high resolution 
images ( e.g ., 10 mega-pixels) due to its huge memory re¬ 
quirements. Note that our model can be also easily imple¬ 
mented with multi-threaded CPU computation. 

4.2. JPEG deblocking experiments 

We also trained diffusion models for the JPEG deblock¬ 
ing problem. We followed the test procedure in [2< ] for 
performance evaluation. The test images were converted to 
gray-value, and scaled by a factor of 0.5, resulting images of 

5 LSSC, EPLL, opt-MRF and RTF 5 methods are much slower than 
BM3D on the CPU, cf. [37]. 


size 240 x 160. We distorted the images by JPEG blocking 
artifacts. We considered three compression quality settings 
q = 10, 20 and 30 for the JPEG encoder. 

We trained three nonlinear diffusion TRD 7X 7 models for 
different compression parameter q. We found that for JPEG 
deblocking, 4 stages are already enough. Results of the 
trained models are shown in Table 3, compared with sev¬ 
eral representative deblocking approaches. We see that our 
trained TRD 7x7 outperforms all the competing approaches 
in terms of PSNR. Furthermore, our model is extremely fast 
on GPU, e.g., for a common image size of 1024 x 1024, 
our model takes about 0.095s, while the strongest competi¬ 
tor (in terms of run time) - SADCT consumes about 56.5s 
with CPU computation 6 . See the supplemental material for 
JPEG deblocking examples. 

5. Conclusion and future work 

We have proposed a trainable reaction diffusion model 
for effective image restoration. Its critical point lies in the 
training of the influence functions. We have trained our 
models for the problem of Gaussian denoising and JPEG de¬ 
blocking. Based on standard test datasets, the trained mod¬ 
els result in the best-reported results. We believe that the 
effectiveness of the trained diffusion models is attributed to 
the following desired properties of the models 

• Anisotropy. In the trained filters, we can find rotated 
derivative filters in different directions, cf. Fig 3. 

• Higher order. The learned filters contain first, second 
and higher-order derivative filters, cf. Fig 3. 

• Adaptive forward/backward diffusion through the 
learned nonlinear functions. Nonlinear functions cor¬ 
responding to explicit backward diffusion appear in the 
learned nonlinearity, cf. Fig 1 . 

Meanwhile, the trained models are very simple and well- 
suited for parallel computation on GPUs. As a conse¬ 
quence, the resulting algorithms are significantly faster than 
all competing algorithms and hence are also applicable to 
the restoration of high resolution images. 

Future work: From a application point of view, we think 
that it will be interesting to consider learned, nonlinear reac¬ 
tion diffusion based models also for other image processing 
tasks such as image super resolution, blind image deconvo¬ 
lution, optical flow. Moreover, since learning the influence 
functions turned out to be crucial, we believe that learning 
optimal nonlinearities in CNs could lead to a similar perfor¬ 
mance increase. Finally, it will also be interesting to inves¬ 
tigate the unconventional penalty functions learned by our 
approach in usual energy minimization approaches. 

6 RTF is slower than SADCT, as it depends on the output of SADCT. 
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Abstract 

This is the supplemental material for our CVPR2015 paper entitled “On learning optimized reaction diffusion processes 
for effective image restoration”. In this supplemental material, we give detailed derivations of gradients required in training 
for corresponding diffusion networks. In addition, we also present more image denoising and JPEG deblocking examples. 


1. Preliminaries 


When we modify the original diffusion equation 


to the following version 


u t 


f Nk \ 

= ^Klut-l) + A‘(«t-1 - fn)\ 


/N k \ 

u t = U t - 1 - f k\ * <j>\(k\ * Ut- 1) + \\ut-i - fn) 


( 1 ) 

( 2 ) 


we find it introduces some imperfections at the image boundary. The basic reason lies in the fact that, in the case of symmetric 
boundary condition used in our work, K T v can be interpreted as the convolution with the kernel k l only in the central region 
of image v. This interpretation does not hold for the image boundary. However, in the diffusion equation (2), the convolution 
kernel k{ is applied to the whole image, thus bringing some artifacts at the boundary. The benefit to use the diffusion equation 
(2), rather than (1) is that the revised model is more tractable in practice, especially for training, as everything can be done 
by the convolution operation efficiently. 

In order to remove this artifacts, we pad the input image u t -\ for stage t, as well as the noisy image / n , with mirror 
reflections of itself. This operation is formulated by the sparse “padding” matrix P. After a diffusion step, we only crop the 
central region of the output image u t for usage. This operation is formulated by the sparse “cropping” matrix T. When we 
apply the matrix P T = P x T to an image u, Ptu corresponds to two operations: it first crops the central region of u, then 
pads it with mirror reflections. 

After taking into account the operation of boundary handling, the exact diffusion process is illustrate in Figure 1. There 
we have u tp = Pru t . 

In our derivations, we use the symmetric boundary condition for the convolution operation k * u (image u G M mxn , 
k G M rxr ). As we know, it is equivalent to the matrix-vector product formulation Ku , where K G R NxN is a highly 
sparse matrix and u is a column vector u G R N with N = mxn. The result k*u can also be interpreted with Uk , where 
matrix U G R NxR is constructed from image u and k is a column vector k G R r with R = r x r. This formulation is 
very helpful for the computation of the gradients of the loss function w.r.t. the kernel fc, as U T v (• v e is a column 
vector) can be explicitly interpreted as a convolution operation, which is widely used in classic convolutional neural 
networks [1]. In the following derivations, we will make use of this equivalence frequently, i.e. 9 


k*u 


Ku 


Uk. 


Recall that kernel k is obtained by rotating k 180 degrees. 
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Figure 1. Proposed nonlinear diffusion process with careful boundary handling operation. Note that u tp = Pru t . 


The derivations also require matrix calculus. We use the denominator layout notation 2 for all the derivations. 


2. Derivations of learning problem 

Given S training samples {fn\ug S ^} f =1 , where fn^ is the noisy input and is the corresponding ground truth clean 
image. Let assume the original image size is m x n. We first pad the noisy image f n with uj pixels, then the resulting 
image has size O = (m + 2uj) x (n + 2uj). We have the corresponding matrix T G R Nx ° , P G R° xN , Pt G R 0x0 and 
fnp £= G R N . 


2.1. Greedy training 


In the greedy training for stage £, we are to minimize the following loss function w.r.t the model parameters 0* = 
{A t , of stage t , 

s s 


L(9 t ) = ^£(f4 3) ,uff) = 'ElwT u[ s) - 
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( 3 ) 


s=1 


8= 1 


where 


u+ = u 


(t-i)p 


< N k 

E 

v*=i 




( 4 ) 


Note that in the training for stage t, the images U( t _i\ p are fixed, served as the input of this feed-forward step. 

As the gradient of overall loss function on the whole training datasets can be decomposed to the sum over training samples, 
in the following derivation, we only consider the case of one training sample for the sake of brevity. The basic result of the 
gradient of the loss function w.r.t the training parameters 0* = {A t , is given 


0£(^Uf , 'tigt ) 5u<* 9£(Ut , 'Ug*) 

<90* 50* du t 


( 5 ) 


where 


simply given as 


d£(ut, Ugt) 

du t 


= T T (Tu t - Ugt ) . 


Let us define a column vector e G R° as 

e = T r (Tu t - u gt ). 

Therefore, the main issue is to calculate from (4). 

Weight parameter X L : It is easy to see that 


^ - fnpV 


Thus is given as 



2 http://en.wikipedia.org/wiki/Matrix_calcuius 
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Filters k\: Concerning the dependency of u t on parameters &•, it is easy to see the following relationship 


u t -»-fcj, * fl (M * , 

f V 


where / and v are two auxiliary variables defined as / = —k\ and v = 4>\(k\ * U( t -i) p )- Therefore, we get the following 
dependency relationship, 



According to the chain rule, we have 

chH = df_ chH dv_ chn 
dk\ dk\ df dk\ dv 

Note that / = —k\, which can be formulated as / = —Pi nv k\ with matrix P inv inverting the kernel vector k\. Recall the 
equivalence 

f *v Fv V f . 


Therefore, the first term of (8) is given as 

9f_ fat _ pT yT 

dk\ df 

For the second term, we introduce an additional auxiliary variable z, defined as z = k\ * u^ t _^ p . Then we have v = 4>\(z). 
Recall that 


Therefore, we obtain 


dv dut _ dz_ ch du^ _ r/T * 
dk\ dv dk l - dz dv ( t ~ 1 )p 


where A_is a diagonal matrix A = diag(0j / (zi), • • • , c/)f (z p )) (cj)*' is the first order derivative of function 4>\). Note that 
F = — K\. In summary, is given as 


du t 

dk\ 


(pJ nv V T + Uj t _ 1)p KKf) . 


(9) 


Finally, we arrive at the desired gradients 


dk\ 


(pJ nv V T +Uj t _ l]v KKf)e. 


( 10 ) 


In practice, we do not need to explicitly construct the matrices V, U , K\. Recall that the product of matrices V T , Uj t _ Fp and 

a vector can be computed by the convolution operator [1]. As shown in a previous work [ 4 ], Kj T can also be computed by 
the convolution operation with the kernel kj with careful boundary handling. Matrix Pj nv is merely a linear operation which 
inverts the vectorized kernel k. In the case of a square kernel k, it is equivalent to the Matlab command 

Pjnyk rot90(rot90(k )). 

If we have a closer look at the diffusion equation (4), we find that it has a scaling problem w.r.t the filters k\. First we 
know the function (p\ is free to tune in the training. In this case, if we scale the filter k\ by a factor of h to generate a new 
filter k\ = hk\ , and the corresponding new function (\>\ is selected such that 4>\{hz) = ^<frl(z), then we will see that the term 

k\ * ^{kj * i) p ) keep unchanged, i.e., two different set of parameters (j)\} and {kj, cf)\} own exactly the same loss 
function £(u t , u gt ). In order to get rid of this ambiguity, it is necessary to fix the scale of the filters. In practice, we learn 
filters with fixed unit norm. Motivated by the finding in [4] that meaningful filters should be zero-mean, we also construct the 
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training filter k from the DCT basis B G R Rx ( r ^ (without the DC-component). Therefore, we define each filter k G R r 
with 

k = B^r-, (11) 


C 2 


where c G Now the training parameters become c, and we need to calculate As shown in (10), we already have 


||, according to the chain rule, we have 
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where || is computed from (11). Let us define an auxiliary variable u = we have 
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(12c) 
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where I e R f/l ’ 1 ) x ( i? x ) is the identity matrix. Combining the Equation (10) and (21), we can obtain the desired gradients 
of the loss function w.r.t the training parameter c\, given as 


(13) 


Influence functions <j>i According to diffusion equation (4), the dependency of u t on the influence function 4>\ is given as 



Let us define an auxiliary variable y e R° as 
In our work, function (j>\ is represented as 


ut ~K\ ■ • u {t _ 1)p ). 

y R-i' • 


M 


i=* v 1 

Therefore, the column vector G M° can be reformulated via a matrix equation 

4>t(y) = G(y)-w \, 

where w\ G M m is the vectorized version of parameters w\-, matrix G(y) G M° xM is given as 


(14) 

(15) 

(16) 


>(“) 




Wil 

Wi2 


<t>i ivi) 

4>i ( 2 / 2 ) 

1- 

o . . 

1 

■e 




WiM 


. 4>i ( vq ) . 


( 17 ) 


G(y) 


<f>i(y) 
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Figure 2. Function approximation via Gaussian ip g (z) or triangular-shaped cpt(z) radial basis function, respectively. 


Now, it is straightforward to obtain , given as 


dut 

dw\ 


-g t k\ 


(18) 


Then we can obtain the desired gradients of the loss function w.r.t the parameters of the influence function, written as 


d£ 

dw\ 


-G T Kfe. 


(19) 


In this paper, we investigate two typical RBFs [8]: (1) Gaussian radial basis cp g and (2) triangular-shaped radial basis ip t , 
which are given as 


= (fi 



= exp 


W) 


and 


<Pt(z) = 



l-“ l*~M l<7 
0 \z-/j,\>-y 


respectively. The basis functions are shown in Figure 2, together with an example of the function approximation by using 
two different RBF methods. 

In Figure 2, we can see that in the case of triangular-shaped RBF based function approximation any input variable z 
only involves two basis functions, i.e., each row of the G matrix (17) only has two non-zero numbers. Therefore, we can 
explicitly make use of this property in the implementation to speed up the computation of Equation (19), i.e., the 
triangular-shaped RBF based training process is generally faster than the Gaussian RBF based one. 

In the training, the first order derivative of the influence function (j> is also required as in Equation (13). In the case of 
Gaussian RBF, the first order derivative is given as 


<f>i(z) = -iX^y-exP (-|(z - Mi) 2 ) • {z ~ Mi) • 

In the case of triangular-shaped RBF, ft is defined by a piece-wise constant function as 0 is a piece-wise linear function. 
Although the influence function <j) and its derivative ft is not smooth, the training still works quite well. 

In practice, in order to speed up the computation of the function value ftz) and its derivative ft(z) for the case of Gaussian 
RBF, we approximate these functions with piece-wise linear functions (the function values at discrete points are precomputed 
and stored in a lookup-table), then the function values at point z can be retrieved efficiently using linear interpolation. 

Concerning the function approximation accuracy, in general, the Gaussian RBF can provide a better approximation for 
generally smooth function than the triangular-shaped RBF with the same number of basis functions, as the latter generates a 
piece-wise linear function for approximation. In order to improve the approximation accuracy of the triangular-shaped RBF 
based method, usually we need to exploit more basis functions relative to the Gaussian RBF based method. However, using 
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more basis functions will bring an undesired problem of over-fitting. Therefore, certain regularization technique is required. 
Unfortunately, up to now we have not figured out the best choice for the regularization term. 

In our work, we have investigated both function approximation methods, and we find that they generate similar results. 
We only present the results obtained by the Gaussian RBF due to space limitation. We do not provide a comprehensive 
comparison of two methods, as it is out of the scope of this paper. 

2.2. Joint training 

In the joint training, the parameters of all stages T are optimized simultaneously. The joint training task is formulated as 


Aei,-,r) = X)w.«S ) ) 


( 20 ) 


3 = 1 


where the loss function only depends on ut, the output of the final stage T. The gradients of the loss function w.r.t @ t is 
given as 

dl{u T ,u gt ) _ du t d£(u T , u g t) 

~ dG t ' 


dO t 


du* 


where 5^ has been already done in the preceding subsection. Now the main issue is to calculate 


know 


^ we onl y 


d£(u T ,u gt ) T 

e = au T =T <r “ T “"»* ) ' 


the standard back-propagation technique widely used in neural networks learning [1( ] can be used to calculate the desired 
gradients, which is written as 


d£(u T ,u gt ) _ dut +1 e(u T ,u g t) 


du t 


du t dut+i 
dut+l _ dut+2 _ £(u T ,U g t) 
du t du t+ 1 du t+2 

dut+i du t+2 du T 


du t du t+ 1 du T -1 


e . 


(21a) 

(21b) 

(21c) 


In practice, (21) is computed using a backward manner starting from the last stage. Now the only thing we need to calculate 
is q^ 1 • Recall the diffusion process shown in Figure 1, it is straightforward to see that 

dU -£-\-1 dUf p dU - fc -\-1 


du t 


Oui du tp 


where = P^ according to the equation u tp = Pru t , and can be obtained from the diffusion equation (4). 


du, 


■t+1 


N k 


du, 


tp 


= (1 - A t+1 )I - ^ K\ +1 ' • A, • (K\ +1 ) 


i=1 


where A i is a diagonal matrix A i = diag(^ +l/ (zi), • • • , (z p )) with 2 = &- +1 * u tp . Therefore, the overall is 

given as 

o / N k \ 

dut+l 7->T / V^r^lT * /fv#xnT 


du t 


e= Pj ■ (1 — A t+1 )I - K l +1 ■ A i ■ ( K i +1 ) 


i=1 


Then the gradients of d ^ u ^gt) can CO mputed using the backward recurrence described above. Once we have obtained 
the results of d ^ u ^ u ad , ^ [ s straightforward to calculate d ^ u ^ 9t ^ using the derivations in previous subsection. 


3. Training for JPEG deblocking 

As mentioned in the main paper, in this paper, we consider the JPEG deblocking problem by defining a new variational 
model, which incorporates the FoE image prior model and the quantization constraint set (QCS). 
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Figure 3. Schematic overview of the JPEG compression and decompression procedure 

3.1. JPEG compression and the QCS 

Figure 3 illustrates all the steps of the JPEG compression and decompression procedure. In the step of quantization, the 
transformed DCT coefficients of each 8x8 block are point-wise divided by the quantization matrix, and then the values are 
rounded to integer, which is where the loss of data takes place, as the rounding operation is a mapping of “oo —>> 1”. Given an 
integer number d , any number in the interval [d — d + \] is a possible candidate for the original number which is rounded 
to d. 

With the compressed image data, we only know the integer coefficient data (dfj)i<i,j<8> where I indicates a 8 x 8 block 
indexed by /, and the quantization matrix ( Qi,j)i<i,j<8 • Therefore, the possible original DCT coefficients, which yield (dj j) 
during the quantization and rounding procedure are given by the interval 

This result is for the block I. For the full size image, we just need to repeat this result for each distinct block. All the intervals 
Sfj associated with each 8x8 block form the so-called QCS, which is simply a box constraint determining all possible 
source images. 

3.2. Variational model for image deblocking 

In our training, an image u of size m x n is padded with uo pixels (we set cc = 8 for this problem). In order to simplify 
the notation, the interval S is represented by two column vectors a E M° and b E M° (O = (m + 2cc) x (n + 2cj)), 
which correspond to the lower and upper bounds of the intervals S[ y , respectively. We further define a highly sparse matrix 
D E M 0x0 , which makes Du equivalent to the block-wise DCT transform applied to the two-dimensional image u. 

Given the compressed image data, the QCS is given as the box constraint S = [a, 6], and the set of possible source image 
of the compression process is defined as 

U = \u E M° | ( Du)p E [CLp , bp ]} . 

Then we can define our variational model based on the FoE image prior model and QCS, which reads as 

N k 

&rgmmE(u) = Pi(ki * u) . (22) 

i=l 

This is a constrained optimization problem, and it can be rewritten as 

N k 

argmin E(u) = V" Pi(ki * u) +l s (Du ), (23) 

u z ^ 

where 

Is (Du) = |° 

I OO 

In this formulation, we exploit the convex set S instead of set U, 


if Du E 5, 
else. 

as S is a box constraint, which is simpler than U. 
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As the minimization problem (23) contains the non-smooth indicator function, the standard gradient descent algorithm is 
not applicable. Therefore, we resort to the more general proximal gradient method [11], which is applicable to solve a class 
of the following minimization problems 

min h(u) = f(u) + g(u) , (24) 

U 

where / is a smooth function and g is convex (possibly non-smooth) function. The proximal gradient method is defined as 

u t = (I + Tdg) -1 (u t -i -rV/(u t _i)), 

where r is the step size parameter, (I + rdg ) _1 denotes the proximal mapping operator. Casting the problem (23) in the form 

N k 

of (24), we have f(u) = Pi(ki * u ) and g{u) = Xs(Du). It is easy to check that 

2=1 


N k _ 

V/(u) = ^ki* * u ), 

2 = 1 

with <j>i(z) = Pi(z). We again make the modification ki to the rigorous formulation Kj. The proximal mapping with respect 
to g is given as the following minimization problem 

(/ + rdg)- 1 (u) = arg min ——+ tX$(Du ). (25) 

u 2 

As DCT is a orthogonal transform, i.e., D T D = DD T = I, then we have 

||Du - Du\\% = (u — u) T D t D(u — u) = (u — u) T (u — u) = \\u — u\\l . 


For problem (25), let 


c = Du, c = Du . 


Note that the connection between c and u (also c and u) is a mapping of one-to-one. 
It turns out that 


arg mm 


+ tXs (Du) arg mir 


TX S (c) • 


2 ' -v / ° c 2 

Obviously, the solution for the minimization problem of right side is given as the following point-wise projection onto the 
interval, i.e., QCS 

Cp if Cp G Sp = [ dp , bp] 

bp if Cp > bp (26) 

^ a p if c p < a p . 

Finally, the the solution of u is given as u = D T c. Therefore, the overall gradient descent step is given as 


u t = D T proj QCS ^u t -1 - ki * * u t -t) j j , 


(27) 


where projQ CS (-) denotes the orthogonal projection onto QCS (26). In our training model, as we consider different filters and 
influence functions for each stage t, the accurate diffusion process is given as 


N k 


U t 


= L> T proj QCS D u t - 1 * u *-i) 


(28) 


2 = 1 


The projection operator can be represented by the function g(z) show in Figure 4. Therefore, the corresponding gradient 
descent step (28) can be rewritten as the following formulas by introducing the auxiliary variables v, z, and y. 


u t = D t v 
v = rj(z) 
z = Dy 


( 29 ) 


N k 


y = u t -1 - E k l * 4>\{ k i * ut- 1) • 
2=1 
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Figure 4. The projection function rj(z). 


For the training of this new problem, the training parameters <d t is given by <d t = {</>*, k\}. According to the result of (5), 
for this new training problem, we can make use of the framework presented in the last section and we just need to recalculate 
from (29), which is given as 


du t dy dz dv du t 
dQ t d@ t dy dz dv 

= DT ' dia s(v'i z )) ■ D , 

where rj'(z) = (r}'(zi), 7 /( 22 ), • • ■ , rf(zo)) T G M° with rj'(z p ) defined as 


V ( z p ) 


1 if Zp G [dp •) bp ] , 

0 else 


(30) 


(31) 


Even though we do not consider any smoothing technique for the non-smooth function 77 , in practice we find that it is not a 
problem for the training procedure by using the above discontinuous derivative. According to the derivations (9) and (18) in 
the last section, it is straightforward to calculate Jgp, which leads to exactly the same results. 

Concerning the joint training model, we can still make use of the same framework presented in the last section and we just 
need to additionally compute Q^ t f \ • Taking into account the operation of boundary handling, ± is given as 


du t T dy dz dv du t 

du t -i T du t -1 dy dz dv 


=P+ 



* T • A,• • Kf T 


■ D t ■ diag(j/(») • D 


(32) 


Combining the derivation results of (30), (32) and the framework presented in the last section, we can reach the formulas 
required for the training of the JPEG deblocking problem. 


4. Denoising and deblocking examples 

In this section, we provide examples to illustrate the performance of our trained nonlinear diffusion processes for Gaussian 
denoising and JPEG deblocking. See Figure 5 and Figure 6 for image denoising examples for noise level a = 25 on the 
images from the test dataset. Note the differences in the highlighted regions. 

We also present the corresponding runtime either on CPU or GPU. We do not count the memory transfer time between 
CPU/GPU for both GPU implementations (if counted, the run time will nearly double). Note the speed gap between CSF 7 X7 
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and our model is not mainly caused by the different Matlab/CUDA implementations, but the lower computational expense of 
our approach. 

Figure 7 presents a denoising example on a megapixel-size natural image of size 1050 x 1680. Overall, nonlocal methods 
(BM3D and WNNM) are prone to generating artifacts. The TRD 7x7 model provides the highest PSNR value, and better 
preserves tiny image structures, such as the tree branches shown in the highlighted region. Furthermore, our method exhibits 
the best runtime. 

Figure 8 presents four JPEG deblocking examples for the compression quality q = 10. Note the effectiveness of our 
trained TRD model, meanwhile, remember that our approach is extremely fast on GPUs. 
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(a) Noisy, 20.17dB 


(b) BM3D, 28.60dB 


(c) CSF^ x7 , 28.83dB (d) WNNM, 28.82dB 


(e) TRD| x5 , 28.85dB (f) TRD^ x7 , 28.97dB 



(g) Noisy, 20.17dB (h) BM3D, 36.78dB/CPU: 2.5s (i) CSF^ x7 , 37.15dB/GPU: 0.55s 



(j) WNNM, 36.95dB/CPU: 393.2s (k) TRD| x5 , 37.04dB/GPU: 9.1ms (1) TRD^ x7 , 37.64dB/GPU: 20.3ms 



(m) Noisy, 20.17dB (n) BM3D, 33.24dB (o) CSFf x7 , 32.93dB (p) WNNM, 33.77dB (q) TRDf x5 , 32.91dB (r) TRD^, 33.34dB 

Figure 5. Denoising results on three test images (a = 25) by different methods (compared with BM3D [5], WNNM [ ] and CSF model 
[ 2]), together with the corresponding computation time either on CPU or GPU. Note that for the last image, which contains many repeated 
local patterns, e.g., the T-shirt region, the nonlocal methods (BM3D and WNNM) generally should work better than the local methods 
(CSF model and our TRD model), because the nonlocal models explicitly exploit nonlocal self-similarity across the image. Even though 
the nonlocal methods can benefit from these repeated local patterns, our trained TRD 7x7 still show strongly competitive performance. Best 
viewed magnified on screen. Note the differences in the highlighted region. 
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(d) WNNM, 27.94dB (e) TRD| x5 , 28.16dB (f) TRD^ x7 , 28.23dB 

Figure 6. Another example noise level a = 25. Note the differences in the highlighted region. 


(d) CSF? x7 , 28.08dB/GPU: 1.91s (e) WNNM, 28.20dB/CPU: 3520s (f) TRD| x7 , 28.32dBdB/GPU: 0.188s 

Figure 7. Denoising example on a high resolution “Chinese bridge” image of size 1050 x 1680 1.68 mega-pixels) for noise level a — 25. 
Compared with BM3D [5], WNNM [7] and CSF model [ 2], together with the corresponding runtime. Our TRD 7x7 model achieves the 
highest PSNR value, and it better preserves tiny image structures, e.g., the tree branches in the highlighted region. (Best viewed magnified 
on screen.) Moreover, our model offers remarkably preferable runtime performance based on GPU implementation. 


(a) Noisy, 20.17dB 


(b) BM3D, 27.53dB 


(c) CSF 7x7 , 28.00dB 


(a) Clean image 


(b) Noisy, 20.17dB 


(c) BM3D, 27.82dB/CPU: 28.1s 
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Fig-1 Clean image Fig-2 Clean image Fig-3 Clean image Fig-4 Clean image 



Fig-5 Lossy image (30.02) Fig-6 Lossy image (27.59) Fig-7 Lossy image (24.24) Fig-8 Lossy image (28.56) 



Fig-9 TGV[2] (30.44) 


Fig-10 TGV[2] (28.31) Fig-11 TGV[2] (25.12) Fig-12 TGV[2] (29.34) 



Fig-13 Die. SR[3] (30.50) Fig-14 Die. SR[3] (28.20) Fig-15 Die. SR[3] (25.11) Fig-16 Die. SR[3] (29.29) 




Fig-21 RTF [9] (30.93) Fig-22 RTF [9] (29.16) Fig-23 RTF [9] (26.37) Fig-24 RTF[9] (29.94) 



Fig-25 TRDf x7 (31.04) Fig-26 TRDf x7 (29.47) Fig-27 TRDf x7 (27.05) Fig-28 TRDf x7 (30.17) 

Figure 8. Image deblocking for images compressed by JPEG encoder with the quality q = 10. Note the differences in the sky. 
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